Until now, we have already seen quite some data formats (CSV/TSV, JSON). In this week, we will learn how to work with one of the most popular structured data format: XML. XML is used a lot in NLP and therefore it is important that you know how to work with it.
etree.parseetree.fromstringetree.tostringuse the following methods and attributes of an XML element (of type lxml.etree._Element):
methodsfind,findall, andgetchildren`gettag and text [not needed for assignment] create your own XML and write it to a file
If you have questions about this chapter, please refer to the forum on Canvas.
NLP is all about data. More specifically, we usually want to annotate (manually or automatically) textual data with information about:
What would data look like that contains all this information? Let's look at a simple example:
In [ ]:
import nltk
In [ ]:
text = nltk.word_tokenize("Tom Cruise is an actor.")
print(nltk.pos_tag(text))
In this example, we see that the format is a list of tuples. The first element of each tuple is the word and the second element is the part of speech tag. Great, so far this works. However, we also want to indicate that Tom Cruise is an entity. Now, we start to run into trouble, because some annotations are for single words and some are for combinations of words. In addition, sometimes we have more than one annotation per token. Data structures such as CSV and TSV are not great at representing linguistic information. So is there a format that is better at it? The answer is yes and the format is XML.
Let's look at an example (the line numbers are there for explanation purposes). On purpose, we start with a non-linguistic, hopefully intuitive example. In the folder ../Data/xml_data this XML is stored as the file course.xml. You can inspect this file using a text editor (e.g. Atom, BBEdit or Notepad++).
1. <Course>
2. <person role="coordinator">Van der Vliet</person>
3. <person role="instructor">Van Miltenburg</person>
4. <person role="instructor">Van Son</person>
5. <person role="instructor">Postma</person>
6. <person role="instructor">Sommerauer</person>
7. <person role="student">Baloche</person>
8. <person role="student">De Boer</person>
9. <animal role="student">Rubber duck</animal>
10. <person role="student">Van Doorn</person>
11. <person role="student">De Jager</person>
12. <person role="student">King</person>
13. <person role="student">Kingham</person>
14. <person role="student">Mózes</person>
15. <person role="student">Rübsaam</person>
16. <person role="student">Torsi</person>
17. <person role="student">Witteman</person>
18. <person role="student">Wouterse</person>
19. <person/>
20. </Course>
Line 1 to 19 all show examples of XML elements. Each XML element contains a starting tag (e.g. <person>) and an end tag (e.g. </person>). An element can contain:
person the child of Course and Course the parent of person.Please note that on line 19 the starting tag and end tag are combined. This happens when an element has no children and/or no text. The syntax for an element is then <START_TAG/>.
A special element is the root element. In our example, Course is our root element. The element starts at line 1 (<Course>) and ends at line 19 (</Course>). Notice the difference between the begin tag (no '/') and the end tag (with '/'). A root element is special in that it is the only element, which is the sole parent element to all the other elements.
Elements can contain attributes, which contain information about the element. In this case, this information is the role a person has in the course. All attributes are located in the start tag of an XML element.
Now that we know the basics of XML, we want to be able to access it in Python. In order to work with XML, we will use the lxml library.
In [ ]:
from lxml import etree
We will focus on the following methods/attributes:
etree.parse() and etree.fromstring()getroot()find(), findall(), and getchildren()get()tag and text
In [ ]:
xml_string = """
<Course>
<person role="coordinator">Van der Vliet</person>
<person role="instructor">Van Miltenburg</person>
<person role="instructor">Van Son</person>
<person role="instructor">Marten Postma</person>
<person role="student">Baloche</person>
<person role="student">De Boer</person>
<animal role="student">Rubber duck</animal>
<person role="student">Van Doorn</person>
<person role="student">De Jager</person>
<person role="student">King</person>
<person role="student">Kingham</person>
<person role="student">Mózes</person>
<person role="student">Rübsaam</person>
<person role="student">Torsi</person>
<person role="student">Witteman</person>
<person role="student">Wouterse</person>
<person/>
</Course>
"""
tree = etree.fromstring(xml_string)
print(type(tree))
The etree.parse() method is used to load XML files on your computer:
In [ ]:
tree = etree.parse('../Data/xml_data/course.xml')
print(type(tree))
As you can see, etree.parse() returns an ElementTree, whereas etree.fromstring() returns an Element. One of the important differences is that the ElementTree class serialises as a complete document, as opposed to a single Element. This includes top-level processing instructions and comments, as well as a DOCTYPE and other DTD content in the document. For now, it's not too important that you know what these are; just remember that there is a difference btween ElementTree and Element.
While etree.fromstring() gives you the root element right away, etree.parse() does not. In order to access the root element of ElementTree, we first need to use the getroot() method. Note that this does not show the XML element itself, but only a reference. In order to show the element itself, we can use the etree.dump() method.
In [ ]:
root = tree.getroot()
print('root', type(root), root)
print()
print('etree.dump example')
etree.dump(root, pretty_print=True)
As with any python object, we can use the built-in function dir() to list all methods of an element (which has the type lxml.etree._Element) , some of which will be illustrated below.
In [ ]:
print(type(root))
dir(root)
In [ ]:
first_person_el = root.find('person')
etree.dump(first_person_el, pretty_print=True)
In order to get a list of all person children, we can use the findall() method.
Notice that this does not return the animal since we are looking for person elements.
In [ ]:
all_person_els = root.findall('person')
all_person_els
Sometimes, we simple want all the children, while ignoring the start tags. This can be achieved using the getchildren() method. This will simply return all children.
Now we do get the animal element again.
In [ ]:
all_child_els = root.getchildren()
all_child_els
The get() method is used to access the attribute of an element.
If an attribute does not exists, it will return None, hence no error.
In [ ]:
first_person_el = root.find('person')
role_first_person_el = first_person_el.get('role')
attribute_not_found = first_person_el.get('blabla')
print('role first person element:', role_first_person_el)
print('value if not found:', attribute_not_found)
The text of an element is found in the attribute text:
In [ ]:
print(first_person_el.text)
The tag of an element is found in the attribute tag:
In [ ]:
print(first_person_el.tag)
<NAF xml:lang="en" version="v3">
<terms>
<term id="t1" type="open" lemma="Tom" pos="N" morphofeat="NNP">
<term id="t2" type="open" lemma="Cruise" pos="N" morphofeat="NNP">
<term id="t3" type="open" lemma="be" pos="V" morphofeat="VBZ">
<term id="t4" type="open" lemma="an" pos="R" morphofeat="DT">
<term id="t5" type="open" lemma="actor" pos="N" morphofeat="NN">
</terms>
<entities>
<entity id="e3" type="PERSON">
<references>
<span>
<target id="t1" />
<target id="t2" />
</span>
</references>
</entity>
</entities>
</NAF>
Again, we use etree.fromstring() to load XML from a string:
In [ ]:
naf_string = """
<NAF xml:lang="en" version="v3">
<text>
<wf id="w1" offset="0" length="3" sent="1" para="1">tom</wf>
<wf id="w2" offset="4" length="6" sent="1" para="1">cruise</wf>
<wf id="w3" offset="11" length="2" sent="1" para="1">is</wf>
<wf id="w4" offset="14" length="2" sent="1" para="1">an</wf>
<wf id="w5" offset="17" length="5" sent="1" para="1">actor</wf>
</text>
<terms>
<term id="t1" type="open" lemma="Tom" pos="N" morphofeat="NNP"/>
<term id="t2" type="open" lemma="Cruise" pos="N" morphofeat="NNP"/>
<term id="t3" type="open" lemma="be" pos="V" morphofeat="VBZ"/>
<term id="t4" type="open" lemma="an" pos="R" morphofeat="DT"/>
<term id="t5" type="open" lemma="actor" pos="N" morphofeat="NN"/>
</terms>
<entities>
<entity id="e3" type="PERSON">
<references>
<span>
<target id="t1" />
<target id="t2" />
</span>
</references>
</entity>
</entities>
</NAF>
"""
naf = etree.fromstring(naf_string)
print(type(naf))
etree.dump(naf, pretty_print=True)
Please note that the structure is as follows:
NAF element is the parent of the elements text, terms, and entitieswf elements are children of the text element, which provides us information about the position of words in the text, e.g. that tom is the first word in the text (id="w1") and in the first sentence (sent="1")term elements are children of the term elements, which provide us information about lemmatization and part of speechentity element is a child of the entities element. We learn from the entity element that the terms t1 and t2 (e.g. Tom Cruise) form an entity of type person.One way of accessing the first target element is by going one level at a time:
In [ ]:
entities_el = naf.find('entities')
entity_el = entities_el.find('entity')
references_el = entity_el.find('references')
span_el = references_el.find('span')
target_el = span_el.find('target')
etree.dump(target_el, pretty_print=True)
Is there a better way? The answer is yes! The following way is an easier way to find our target element:
In [ ]:
target_el = naf.find('entities/entity/references/span/target')
etree.dump(target_el, pretty_print=True)
You can also use findall() to find all target elements:
In [ ]:
for target_el in naf.findall('entities/entity/references/span/target'):
etree.dump(target_el, pretty_print=True)
Please note that this section is optional, meaning that you don't need to understand this section in order to complete the assignment.
There are three main steps:
You create a new XML object by:
root element -> using etree.Element etree.ElementTreeYou do not have to fully understand how this works. Please make sure you can reuse this code snippet when you create your own XML.
In [ ]:
our_root = etree.Element('Course')
our_tree = etree.ElementTree(our_root)
We can inspect what we have created by using the etree.dump() method. As you can see, we only have the root node Course currently in our document.
In [ ]:
etree.dump(our_root, pretty_print=True)
As you see, we created an XML object, containing only the root element Course.
In [ ]:
# Define tag, attributes and text of the new element
tag = 'person' # what the start and end tag will be
attributes = {'role': 'student'} # dictionary of attributes, can be more than one
name_student = 'Lee' # the text of the elements
# Create new Element
new_person_element = etree.Element(tag, attrib=attributes)
new_person_element.text = name_student
# Add to root
our_root.append(new_person_element)
# Inspect the current XML
etree.dump(our_root, pretty_print=True)
However, this is so common that there is a shorter and much more efficient way to do this: by using etree.SubElement(). It accepts the same arguments as the etree.Element() method, but additionally requires the parent as first argument:
In [ ]:
# Define tag, attributes and text of the new element
tag = 'person'
attributes = {'role': 'student'}
name_student = 'Pitt'
# Add to root
another_person_element = etree.SubElement(our_root, tag, attrib=attributes) # parent is our_root
another_person_element.text = name_student
# Inspect the current XML
etree.dump(our_root, pretty_print=True)
As we have seen before, XML can have multiple nested layers. Creating these works the same way as adding child elements to the root, but now we specify one of the other elements as the parent (in this case, new_person_element).
In [ ]:
# Define tag, attributes and text of the new element
tag = 'pet'
attributes = {'role': 'joy'}
name_pet = 'Romeo'
# Add to new_person_element
new_pet_element = etree.SubElement(new_person_element, tag, attrib=attributes) # parent is new_person_element
new_pet_element.text = name_pet
# Inspect the current XML
etree.dump(our_root, pretty_print=True)
In [ ]:
with open('../Data/xml_data/selfmade.xml', 'wb') as outfile:
our_tree.write(outfile,
pretty_print=True,
xml_declaration=True,
encoding='utf-8')
In [ ]:
xml_string = """
<Course>
<person role="coordinator">Van der Vliet</person>
<person role="instructor">Van Miltenburg</person>
<person role="instructor">Van Son</person>
<person role="instructor">Marten Postma</person>
<person role="student">Baloche</person>
<person role="student">De Boer</person>
<animal role="student">Rubber duck</animal>
<person role="student">Van Doorn</person>
<person role="student">De Jager</person>
<person role="student">King</person>
<person role="student">Kingham</person>
<person role="student">Mózes</person>
<person role="student">Rübsaam</person>
<person role="student">Torsi</person>
<person role="student">Witteman</person>
<person role="student">Wouterse</person>
<person/>
</Course>
"""
tree = etree.fromstring(xml_string)
print(type(tree))
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In the folder ../Data/xml_data there is an XML file called framenet.xml, which is a simplified version of the data provided by the FrameNet project.
FrameNet is a lexical database describing semantic frames, which are representations of events or situations and the participants in it. For example, cooking typically involves a person doing the cooking (Cook), the food that is to be cooked (Food), something to hold the food while cooking (Container) and a source of heat (Heating_instrument). In FrameNet, this is represented as a frame called Apply_heat. The Cook, Food, Heating_instrument and Container are called frame elements (FEs). Words that evoke this frame, such as fry, bake, boil, and broil, are called lexical units (LUs) of the Apply_heat frame. FrameNet also contains relations between frames. For example, Apply_heat has relations with the Absorb_heat, Cooking_creation and Intentionally_affect frames. In FrameNet, frame descriptions are stored in XML format.
framenet.xml contains the information about the frame Waking_up. Parse the XML file and print the following:
Waking_up (e.g. Event with the Inherits from relation)
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]: